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Abstract 

A  system  for  private  stream  searching,  introduced  by  Ostrovsky  and  Skeith  [18],  allows  a  client  to 
provide  an  untrusted  server  with  an  encrypted  search  query.  The  server  uses  the  query  on  a  stream 
of  documents  and  returns  the  matching  documents  to  the  client  while  learning  nothing  about  the 
nature  of  the  query.  We  present  a  new  scheme  for  conducting  private  keyword  search  on  stream¬ 
ing  data  which  requires  0(m )  server  to  client  communication  complexity  to  return  the  content 
of  the  matching  documents,  where  m  is  the  size  of  the  documents.  The  required  storage  on  the 
server  conducting  the  search  is  also  0(m).  Our  technique  requires  some  metadata  to  be  returned 
in  addition  to  the  documents;  for  this  we  present  a  scheme  with  0(m\og(t/m))  communication 
and  storage  complexity.  In  many  streaming  applications,  the  number  of  matching  documents  is 
expected  to  be  a  fixed  fraction  of  the  stream  length;  in  this  case  the  new  scheme  has  the  optimal 
0(m )  overall  communication  and  storage  complexity  with  near  optimal  constant  factors.  The  pre¬ 
vious  best  scheme  for  private  stream  searching  was  shown  to  have  (){rn  log  m)  communication 
and  storage  complexity.  In  applications  where  y-  >  m,  we  may  revert  to  an  alternative  method  of 
returning  the  necessary  metadata  which  has  (){rri  log  rri)  communication  and  storage  complexity; 
in  this  case  constant  factor  improvements  over  the  previous  scheme  are  achieved.  Our  solution 
employs  a  novel  construction  in  which  the  user  reconstructs  the  matching  files  by  solving  a  sys¬ 
tem  of  linear  equations.  This  allows  the  matching  documents  to  be  stored  in  a  compact  buffer 
rather  than  relying  on  redundancies  to  avoid  collisions  in  the  storage  buffer  as  in  previous  work. 
We  also  present  a  unique  encrypted  Bloom  filter  construction  which  is  used  to  encode  the  set  of 
matching  documents.  In  this  paper  we  describe  our  scheme,  prove  it  secure,  analyze  its  asymptotic 
performance,  and  describe  several  extensions. 


1  Introduction 


The  Internet  currently  has  several  different  types  of  sources  of  information.  These  include  con¬ 
ventional  websites,  time  sensitive  web  pages  such  as  news  articles  and  blog  posts,  real  time  public 
discussions  through  channels  such  as  IRC,  newsgroup  posts,  online  auctions,  and  web  based  fo¬ 
rums  or  classified  ads.  One  common  link  between  all  of  these  sources  is  that  searching  mechanisms 
are  vital  for  a  user  to  be  able  to  distill  the  information  relevant  to  him. 

Most  search  mechanisms  involve  a  client  sending  a  set  of  search  criteria  to  a  server  and  the 
server  performing  the  search  over  some  large  data  set.  However,  for  some  applications  a  client 
would  like  to  hide  his  search  criteria,  i.e.,  which  type  of  data  he  is  interested  in.  A  client  might 
want  to  protect  the  privacy  of  his  search  queries  for  a  variety  of  reasons  ranging  from  personal 
privacy  to  protection  of  commercial  interests. 

A  naive  method  for  allowing  private  searches  is  to  download  the  entire  resource  to  the  client 
machine  and  perform  the  search  locally.  This  is  typically  infeasible  due  to  the  large  size  of  the  data 
to  be  searched,  the  limited  bandwidth  between  the  client  and  a  remote  entity,  or  to  the  unwillingness 
of  a  remote  entity  to  disclose  the  entire  resource  to  the  client. 

In  many  scenarios  the  documents  to  be  searched  are  being  continually  generated  and  are  already 
being  processed  as  a  stream  by  remote  servers.  In  this  case  it  would  be  advantageous  to  allow 
clients  to  establish  persistent  searches  with  the  servers  where  they  could  be  efficiently  processed. 
Content  matching  the  searches  could  then  be  returned  to  the  clients  as  it  arises.  For  example, 
Google  News  Alerts  system  [1]  emails  users  whenever  web  news  articles  crawled  by  Google  match 
their  registered  search  keywords.  In  this  paper  we  develop  an  efficient  cryptographic  system  which 
allows  services  of  this  type  while  provably  maintaining  the  secrecy  of  the  search  criteria. 

Private  Stream  Searching  Recently,  Ostrovsky  and  Skeith  defined  the  problem  of  “private  fil¬ 
tering”,  which  models  the  situations  described  above.  They  gave  a  scheme  based  on  the  homomor¬ 
phism  of  the  Paillier  cryptosystem  [19,  9]  providing  this  capability  [18].  First,  a  public  dictionary 
of  keywords  D  is  fixed.  To  construct  a  query  for  the  disjunction  of  some  keywords  K  C  D,  the 
user  produces  an  array  of  ciphertexts,  one  for  each  w  G  D.  If  w  G  K ,  a  one  is  encrypted;  otherwise 
a  zero  is  encrypted.  A  server  processing  a  document  in  its  stream  may  then  compute  the  product 
of  the  query  array  entries  corresponding  to  the  keywords  found  in  the  document.  This  will  result 
in  the  encryption  of  some  value  c,  which,  by  the  homomorphism,  is  non-zero  if  and  only  if  the 
document  matches  the  query.  The  server  may  then  in  turn  compute  E  (cy  =  E  ( cf ),  where  /  is 
the  content  of  the  document,  obtaining  either  an  encryption  of  (a  multiple  of)  the  document  or  an 
encryption  of  zero. 

Ostrovsky  and  Skeith  propose  the  server  keep  a  large  array  of  ciphertexts  as  a  buffer  to  accu¬ 
mulate  matching  documents;  each  E  (cf)  value  is  multiplied  into  a  number  of  random  locations  in 
the  buffer.  If  the  document  matches  the  query  then  c  is  non-zero  and  copies  of  that  document  will 
be  placed  into  these  random  locations;  otherwise,  c  =  0  and  this  step  will  add  an  encryption  of  0 
to  each  location,  having  no  effect  on  the  corresponding  plaintexts.  A  fundamental  property  of  their 
solution  is  that  if  two  different  matching  documents  are  ever  added  to  the  same  buffer  location 
then  we  will  have  a  collision  and  both  copies  will  be  lost.  If  all  copies  of  a  particular  matching 
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document  are  lost  due  to  collisions  then  that  document  is  lost,  and  when  the  buffer  is  returned  to 
the  client,  he  will  not  be  able  to  recover  it. 

To  avoid  the  loss  of  data  in  this  approach  one  must  make  the  buffer  sufficiently  large  so  that  this 
event  does  not  happen.  This  requires  that  the  buffer  be  much  larger  than  the  expected  number  of 
required  documents.  In  particular,  Ostrovsky  and  Skeith  show  that  a  given  probability  of  success¬ 
fully  obtaining  all  matching  documents  may  be  obtained  with  a  buffer  of  size  0(m  logm),1  where 
m  is  the  number  of  matching  documents.  While  effective,  this  scheme  results  in  inefficiency  due 
to  the  fact  that  a  significant  portion  of  the  buffer  returned  to  the  user  consists  of  empty  locations 
and  document  collisions. 

Our  Approach  In  this  paper  we  present  a  new  private  stream  searching  scheme  which  achieves 
the  optimal  0(m)  communication  from  the  server  to  the  client  and  server  storage  overhead  in  re¬ 
turning  the  content  of  the  matching  documents,  given  any  fixed  probability  of  successfully  retriev¬ 
ing  all  matching  documents.  Metadata  required  for  the  reconstruction  of  the  documents  is  returned 
using  a  technique  requiring  0(m  log (t/m))  communication  and  storage.  The  latter  technique  also 
results  in  the  optimal  0(m )  complexity  with  near  optimal  constant  factors  in  applications  where 
each  document  matches  the  query  with  some  probability,  independent  of  the  other  documents.  In 
applications  where  a  fixed  number  of  documents  are  expected  to  match,  regardless  of  the  stream 
length,  a  modification  to  our  scheme  produces  the  previous  0{m  log  m)  complexity,  but  with  near 
optimal  constant  factor  overhead.  These  results  are  based  on  the  novel  combination  of  a  few  tech¬ 
niques. 

Like  the  approach  of  Ostrovsky  and  Skeith  we  give  an  encrypted  dictionary  and  non-matching 
documents  have  no  effect  on  the  encrypted  contents.  However,  rather  than  using  a  large  buffer  and 
attempting  to  avoid  collisions,  each  matching  document  in  our  system  is  copied  randomly  over 
approximately  half  of  the  locations  across  the  buffer.  A  pseudo-random  function,  g,  whose  key  is 
shared  by  the  client  and  server,  will  determine  pseudo-randomly  with  probability  |  whether  the 
document  is  copied  into  a  given  location,  where  the  function  takes  as  inputs  the  document  number 
(document  number  i  is  the  ith  document  seen  by  the  server)  and  buffer  location.  While  any  one 
particular  buffer  location  will  not  likely  contain  any  information  about  any  matching  document, 
with  high  probability  all  the  information  from  all  the  matching  documents  can  be  retrieved  from  the 
whole  system  by  the  client  given  that  the  client  knows  the  number  of  matching  documents  and  that 
the  number  of  matching  documents  is  less  than  the  buffer  size.  The  client  can  do  this  by  decrypting 
the  buffer  and  then  solving  a  linear  system  to  retrieve  the  original  documents.  Finally,  the  server 
maintains  a  separate  encrypted  Bloom  filter  that  efficiently  keeps  track  of  which  document  numbers 
were  matched.  The  use  of  an  efficient  Bloom  filter  to  keep  track  along  with  our  method  of  storing 
documents  allows  us  to  store  the  encrypted  documents  in  a  much  smaller  buffer. 

'Specifically,  they  define  a  correctness  parameter  7  and  use  a  buffer  of  size  0(7771).  They  show  that  a  given  success 
probability  may  be  achieved  with  a  7  that  is  0(log  to). 
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1.1  Related  Work 


Private  searching  may  be  viewed  as  the  flip  side  of  searching  on  encrypted  data  [21,3,  11];  in  this 
case  the  data  is  unencrypted  and  the  query  is  encrypted.  Goh  applied  Bloom  filters  in  a  way  that 
allows  a  server  to  store  encrypted-searchable  data  in  a  more  efficient  manner. 

However,  searching  on  encrypted  data  is  quite  different  from  private  searching.  In  the  problem 
of  searching  on  encrypted  data  the  data  is  hidden  from  the  server,  while  in  private  searching  the 
data  is  known  to  the  server  and  the  client’s  queries  must  remain  hidden.  Private  searching  is 
actually  most  closely  related  to  the  topics  of  single-database  private  information  retrieval  [8,  15,  5, 
6]  and  oblivious  transfer  [17,  16].  One  incompatibility  between  previously  proposed  PIR  schemes 
and  the  present  problem  is  that  PIR  schemes  have  thus  far  required  communication  dependent  on 
the  size  of  the  entire  database  rather  than  the  size  of  the  portion  retrieved.  In  some  streaming 
settings,  a  private  searching  scheme  with  communications  independent  of  the  size  of  the  stream 
or  database  is  desirable.  Another  difference  between  the  PIR  and  private  search  settings  is  that 
most  PIR  constructions  model  the  database  to  be  searched  as  a  long  bitstring  and  the  queries  as 
indices  of  bits  to  be  retrieved.  In  contrast,  the  system  proposed  in  this  paper  and  that  of  Ostrovsky 
and  Skeith  allow  queries  based  on  a  search  for  keywords  within  text.  Both  these  schemes  may 
also  retrieve  pieces  of  data  by  index,  however.  The  text  associated  with  a  block  of  data  in  the 
database  against  which  queries  are  matched  is  arbitrary,  so  by  simply  including  strings  of  the  form 
“blocknumber:  1”,  “blocknumber:2”,  ...  in  the  text  associated  with  each  block  of  data,  they  may 
be  explicitly  retrieved  by  appropriate  queries.  There  has  been  some  consideration  of  search  or 
retrieval  by  keyword  rather  than  index  in  the  PIR  literature  [7,  14,  10],  but  none  of  these  systems 
has  communication  dependent  only  on  the  size  of  the  data  retrieved  rather  than  some  function  of 
the  length  of  the  database  or  stream.  In  [2]  we  experimentally  analyzed  the  performance  of  our 
system  in  the  setting  of  a  realistic  application,  comparing  it  with  the  scheme  of  Ostrovsky  and 
Skeith. 


2  Definitions  and  Preliminaries 

In  this  section  we  describe  the  problem  of  private  searching  and  make  appropriate  definitions.  We 
also  briefly  review  Paillier’s  cryptosystem  and  the  definition  of  a  pseudo-random  function  family. 

2.1  Problem  Definition 

In  a  private  searching  scheme  a  client  will  create  an  encrypted  query  for  the  set  of  keywords  that  he 
is  interested  in.  The  client  will  give  this  encrypted  query  to  the  server.  The  server  will  then  run  a 
search  algorithm  on  a  stream  of  files2  while  keeping  an  encrypted  buffer  storing  information  about 
files  for  which  there  is  a  keyword  match.  The  encrypted  buffer  will  then  be  returned  to  the  client 
(periodically)  to  enable  the  client  to  reconstruct  the  files  that  have  matched  his  query  keywords. 
We  call  a  file  a  matching  file  if  it  matches  at  least  one  keyword  in  the  set  of  keywords  that  the 

2We  use  the  name  “file”  as  a  general  term  for  the  data  chunk  that  is  to  be  returned.  The  type  of  data  will  vary  by 
application. 
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client  is  interested  in.  The  key  aspect  of  a  private  searching  scheme  is  that  a  server  is  capable  of 
conducting  the  search  even  though  it  does  not  know  which  set  of  keywords  the  client  is  interested 
in.  We  now  formally  describe  a  private  stream  search  scheme.  A  scheme  for  private  stream  search 
scheme  consists  of  the  following  three  algorithms. 

Query-Construction  (A,  e,  m,  K )  The  QueryConstruct  ion  algorithm  is  run  by  a  client 
to  prepare  an  encrypted  list  of  keywords  that  he  would  like  the  server  to  search  for.  The  algorithm 
takes  as  input  a  security  parameter  A,  a  correctness  parameter  e,  an  upper  bound  on  the  number 
files  to  retrieve  m,  and  an  unencrypted  set  of  strings  K  that  are  to  be  used  as  the  search  keywords. 

The  algorithm  outputs  a  public  key  Kpub,  a  private  key  Kpriv,  and  an  encrypted  query  ().  The 
client  then  sends  Kpub,  Q  to  the  server.  The  correctness  parameter  e  may  be  used  to  select  various 
algorithm  parameters  to  ensure  that  up  to  m  files  will  be  correctly  retrieved  with  high  probability. 

These  additional  parameters  are  also  sent  to  the  server. 

StreamSearch  ( Kpub ,  Q,  /i, . . . ,  ft,  Wi, . . Wt)  The  StreamSearch  algorithm  is  run  by 
a  server  to  perform  a  private  keyword  search  on  behalf  of  the  client  on  a  stream  of  files.  The 
algorithm  takes  as  input  an  encrypted  query  Q,  a  public  key  Kpub,  and  a  stream  of  files  /  = 

(/i,  /2, . . . ,  ft)  and  corresponding  sets  of  keywords  that  describe  each  file  W  =  {W\ , . . . ,  Wt). 
Normally  each  set  Wi  is  derived  from  the  corresponding  file  ft  as  a  preprocessing  step.  The 
algorithm  produces  a  buffer  of  encrypted  results  R  which  is  sent  back  to  the  client  after  processing 
some  number  of  files  t,  which  is  chosen  by  the  client.  The  choice  of  t  is  application  dependent  and 
should  ensure  that  no  more  than  m  matching  documents  are  likely  to  be  found. 

FileReconstruction  ( KpriV ,  R )  The  FileReconstruct ion  algorithm  is  used  to  ex¬ 
tract  the  set  of  matching  files  from  the  returned  encrypted  buffer.  The  algorithm  FileReconstruct  ion 
takes  as  input  the  private  key  Kpriv  and  a  buffer  of  encrypted  results  R.  It  outputs  the  set  of  match¬ 
ing  files  {/*  |  \K  n  Wi\  >  0  }. 

To  define  privacy  for  a  private  stream  search  scheme,  consider  the  following  game  between 
a  challenger  and  an  adversary.  The  adversary  gives  the  challenger  two  sets  of  keyword  strings 
K0,  I\\ .  The  challenger  then  flips  a  coin  (3,  runs  the  QueryConstruct  ion  (A,  e,  in,  Kg),  and 
gives  the  public  key  and  the  encrypted  query  Q  to  the  adversary.  The  adversary  then  outputs  a 
guess  / 3 We  say  that  an  adversary  has  advantage  e  if  |P  (/3  =  f3')  —  ||  >  e 

Definition  1.  We  say  that  a  private  searching  scheme  is  semantically  secure  if  for  all  PPT  adver¬ 
saries  A,  the  advantage  of  A  is  negligible  in  the  security  parameter,  A. 

We  establish  that  the  proposed  system  satisfies  this  definition  in  Section  4.2. 

2.2  Preliminaries 

Paillier’s  Cryptosystem  We  now  provide  a  brief  review  of  the  most  important  features  of  the 
Paillier  cryptosystem.  The  Paillier  cryptosystem  is  a  public  key  cryptosystem;  as  in  RSA  the 
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public  key  n  is  the  product  of  two  large  primes.  The  factorization  of  n  is  the  private  key.  In  this 
paper  the  encryption  of  a  plaintext  m  with  the  public  key  (there  is  only  one  public  key  in  use  in  this 
paper,  the  one  generated  by  the  client  when  constructing  a  private  search)  is  denoted  E  (m),  and 
the  decryption  of  a  ciphertext  c  with  the  private  key  is  denoted  D  (c).  Plaintexts  are  represented 
by  elements  of  the  group  Z„  and  ciphertexts  are  represented  by  elements  of  the  group  Z*2,  so 
E  :  Zn  — >  Z*2  and  D  :  Z*2  — >  Zn.  Note  that  ciphertexts  are  twice  as  large  as  plaintexts.3 

The  key  property  of  the  Paillier  cryptosystem  upon  which  the  entire  system  is  based  is  its  ho¬ 
momorphism.  For  any  a,  b  e  Zn,  it  is  the  case  that  D  (E  (a)  ■  E  (b))  =  a  +  b.  That  is,  multiplying 
ciphertexts  has  the  effect  of  adding  the  corresponding  plaintexts.  This  allows  one  to  perform  rudi¬ 
mentary  computations  on  encrypted  values.  Our  construction  may  be  adapted  to  use  any  public  key, 
homomorphic  cryptosystem,  but  for  concreteness,  we  assume  the  use  of  the  Paillier  cryptosystem 
throughout  the  rest  of  the  paper. 

Pseudo-Random  Functions  In  our  construction  we  use  a  pseudo-random  function  family  G  : 
Kg  xZxZ->  (0, 1}.  Roughly  speaking,  G  will  take  in  a  key  k  and  two  integers  and  output  a 

R 

pseudo  random  bit.  We  let  g  =  Gk  where  k  /CG. 

The  security  of  a  pseudo-random  function  family  G  :  /CG  xZxZ— >{0,1}  is  defined  by 
the  following  game  between  a  challenger  and  an  adversary  A.  A  challenger  chooses  a  random  key 

R 

k  K-g  and  lets  g  =  Gj~.  The  challenger  then  flips  a  binary  coin  (3.  At  this  point  the  adversary 
submits  to  make  oracle  queries  to  the  challenger  over  the  domain.  If  (3  =  0  the  challenger  will 
respond  by  evaluating  the  function  g  on  the  input,  whereas  if  (3  =  1  it  will  respond  with  random 
bit  to  all  new  queries,  while  giving  the  same  response  if  the  same  query  is  asked  twice.  Finally,  the 
adversary  outputs  a  guess  (3' .  We  define  the  adversary’s  advantage  in  this  game  as: 

Adv.4  =  |Pr[/3  =  (3']  -  1/2  j 

We  say  that  a  pseudo  random  function  is  (c ut,  uq,  e)-  secure  if  no  ujt  time  adversary,  that  makes 
at  most  LUq  oracle  queries,  has  advantage  at  greater  than  e. 


3  New  Construction 

We  now  describe  the  algorithms  of  the  new  private  search  scheme  and  give  an  analysis  of  complex¬ 
ity  and  security  properties.  In  the  following  explanations,  we  defer  discussion  of  several  special 
failure  cases  to  the  next  subsection. 

'This  property  of  inflating  messages  by  encrypting  them  is  improved  in  Damgard-Jurik  generalization  of  the  Paillier 
cryptosystem  [9],  In  their  scheme  the  plaintext  and  ciphertext  spaces  are  Zra»  and  Z*s+i  for  any  s  £  {1,2, . . .}. 
However,  the  constraints  in  this  paper  are  likely  to  make  the  original  situation  of  s  =  1  preferable. 
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Algorithm:  QueryConstruction 

Input:  Set  of  keywords  K. 

Output:  Query  array  0  =  (E  (q\ ) ,  E  (q2)  , . 

•  >  E  (q|D|))’  public  key  n. 

Generate  a  Paillier  key  pair  n,  KpriV. 

for  i  :=  1,2, ... ,  \D\  : 

if  Wi  £  K  : 

Qi  ■=  1 

else  : 

Qi  ■=  0 

Q[i]  ==£(ft) 

Figure  1 :  The  algorithm  for  setting  up  an  encrypted  query. 


3.1  Client’s  QueryConstruction  Procedure 

Figure  1  gives  the  algorithm  for  producing  the  encrypted  query,  QueryConstruction.  A  pub¬ 
lic  dictionary  of  potential  keywords 

D  =  {wi,w2,  ■  ■ 

is  assumed  to  be  available.  Constructing  the  encrypted  query  for  some  disjunction  of  keywords 
K  C  D  then  proceeds  as  in  the  scheme  of  Ostrovsky  and  Skeith.  The  client  generates  a  key 
pair,  then  for  each  i  e  1, . . . ,  \D\,  defines  q,  —  1  if  w,  e  K  and  q,  —  0  if  tv,  ^  K.  The 
values  qi,q2, ,  q\o\  are  encrypted  (rerandomizing  each  encryption)  and  put  in  the  array  0  = 
(E  (q1),E(q2),...,E(q\D  \  )),  which  forms  the  final  encrypted  query.  In  Section  5.2  we  give  an 
alternative  form  for  the  encrypted  queries  which  eliminates  the  public  dictionary  I).  The  client 
then  sends  Q  and  the  public  key  n  to  the  server. 

3.2  Server’s  StreamSearch  Procedure 

Figure  2  gives  the  full  algorithm  run  by  the  server,  StreamSearch.  In  addition  to  the  public  key 
and  Q,  the  client  may  provide  the  server  with  the  parameter  t,  the  number  of  files  to  process  before 
returning  the  results,  and  the  parameters  lF,  £i,  and  k,  which  affect  correctness  and  performance 
(see  below  and  Section  4.1). 

State  The  server  must  maintain  three  buffers  as  it  processes  the  files  in  its  stream.  These  buffers 
are  hereafter  referred  to  as  the  data  buffer,  the  c-buffer,  and  the  matching-indices  buffer  and 
denoted  F,  C,  and  /  respectively.  Each  of  these  is  an  array  of  elements  from  the  ciphertext  space 
Z*2,  with  F  and  C  of  length  ip  and  /  of  length  (ij.  For  simplified  notation  here  and  in  subsequent 
explanations,  we  assume  that  each  document  is  at  most  n  bits  and  therefore  fits  within  a  single 
plaintext  in  Zn.  For  longer  documents  requiring  s  elements  of  Zn,  we  would  let  F  be  an  £p  x  s 
array  and  subsequent  operations  involving  a  file  updating  F  are  performed  blockwise. 
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Algorithm:  StreamSearch 

Input:  Q,  n,  number  of  files  to  process  t,  sequence  of  files  fi, ft 
with  corresponding  keyword  sets  W\ , . . .  Wt,  size  of  data  buffer 
£[7,  size  of  matching  indices  buffer  £j,  number  of  hash  functions  k 
Output:  Data  buffer  F,  coefficients  buffer  C,  matching  indices  buffer  /. 

Initialize  F  and  C  as  />  element  arrays  and  I  as  an  £j  element  array  of 
members  of  Z*2.  Initialize  each  element  of  F,  C,  and  I  to  E  (0). 

for  i  :=  1,  2, . . . ,  t  : 
c:=E(  0) 
for  Wj  €  Wi  : 

c  :=  c  •  Q[j]  mod  n2 

e  :=  cF  mod  n2 
for  j  :=  1,2,. .  .,£f  : 
if  9(i,j)  =  1  : 

F[j]  :=  F[j]  ■  e  mod  n2 
C[j]  :=  C[j]  ■  c  mod  n2 

for  j  :=  1,2,. . . ,  k  : 
i  :=  hj(i')  mod  £j 
I[t\  :=  I[t]  ■  c  mod  n2 


Figure  2:  The  algorithm  for  running  the  private  search. 


The  data  buffer  will  store  the  matching  files  in  an  encrypted  form  which  can  then  be  used  by 
the  client  to  reconstruct  the  matching  files.  In  particular,  the  data  buffer  will  contain  a  system  of 
linear  equations  in  terms  of  the  content  of  the  matching  files  in  an  encrypted  form.  This  system  of 
equations  will  later  be  solved  by  the  client  to  obtain  the  matching  files. 

The  c-buffer  stores  in  an  encrypted  form  the  number  of  keywords  matched  by  each  matching 
file.  We  call  the  number  of  keywords  matched  for  a  file  the  c-value  of  the  file.  The  c-buffer  will 
be  used  in  reconstruction  of  the  matching  files  from  the  data  buffer  by  the  client.  As  in  the  case  of 
the  data  buffer,  the  c-buffer  stores  its  information  in  the  form  of  a  system  of  linear  equations.  The 
client  will  later  solve  the  system  of  linear  equations  to  reconstruct  the  c-values. 

The  matching-indices  buffer  is  an  encrypted  Bloom  filter  that  keeps  track  of  the  indices  of 
matching  files  in  an  encrypted  form.  More  precisely,  the  matching-indices  buffer  will  be  a  en¬ 
crypted  representation  of  some  set  of  indices  {ay, . . . ,  ar}  where  {aq, . . . ,  ar }  C  {1, . . . ,  t}.  Here 
r  is  the  number  of  files  which  end  up  matching  the  query. 

Each  of  these  buffers  begins  with  all  its  elements  initialized  to  encryptions  of  zero.  We  now 
detail  how  they  are  updated  as  each  file  is  processed. 
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Processing  Steps  To  process  the  zth  file  f,,  the  server  takes  the  following  steps. 

Step  1:  Compute  encrypted  c-value.  First,  the  server  looks  up  the  query  array  entry  Q[j]  corre¬ 
sponding  to  each  word  Wj  found  in  the  file.  The  product  of  these  entries  is  then  computed.  Due  to 
the  homomorphic  property  of  the  Paillier  cryptosystem,  this  product  is  an  encryption  of  c-value  of 
the  file,  i.e.,  the  number  of  distinct  members  of  K  found  in  the  file.  That  is, 

WjGWi  V  7 

where  iiy  is  the  set  of  distinct  words  in  the  zth  file  and  ct  is  defined  to  be  K  n  Wt  | .  Note  in 
particular  that  cl  0  if  and  only  if  the  file  matches  the  query. 


Step  2:  Update  data  buffer.  The  server  computes  E  (c,/,)  using  the  homomorphic  property  of  the 
Paillier  cryptosystem. 


E  (ciY*  —  E  (cifi) 


E  ( Cj  f  t )  if  fi  matches  the  query 

E  (0)  otherwise. 


The  server  multiplies  the  value  E  (<::,/,)  into  a  subset  of  the  locations  in  the  data  buffer  according 
to  the  following  procedure.  Let  G  be  a  family  of  pseudo-random  functions  that  map  Z  x  Z  to 
{0, 1}.  Randomly  select  g  G  (this  should  be  done  once  upon  initialization  and  the  same  g 
used  for  all  files).  The  algorithm  multiplies  E  (Cifi)  into  each  location  j  in  the  data  buffer  where 
g(i,j)  =  1.  Suppose  for  example  we  are  updating  the  third  location  in  the  data  buffer  with  the 
second  file.  Assume  that  first  file  was  also  multiplied  into  this  location,  i.e.,  g{  1,  3)  =  g( 2,  3)  =  1. 
Each  of  the  two  files  may  or  may  not  match  the  query.  Suppose  in  this  example  that  ff  matches 
the  query,  but  f2  does  not.  Before  processing  f2  we  have  that  D  (F[3])  =  ciff.  After  multiplying 
in  E  (C2/2),  D  (F[3])  =  ci/i  +  c2f2.  But  c2  =  0  since  /2  does  not  match,  so  it  is  still  the  case  that 
D  (F[3])  =  C1/1  and  the  data  buffer  is  effectively  unmodified.  This  mechanism  allows  the  data 
buffer  to  accumulate  linear  combinations  of  matching  files  while  discarding  all  non-matching  files. 
Step  3:  Update  c-buffer.  The  value  E  (q)  is  multiplied  into  each  of  the  locations  in  the  c-buffer  in 
a  similar  fashion  as  E  was  used  to  update  the  data  buffer.  In  particular,  the  server  multiplies 
the  value  E  ( c, )  into  each  location  j  in  the  c-buffer  where  g(i.  j)  =  1. 


Step  4:  Update  matching -indices  buffer.  The  server  then  multiplies  E  (q)  further  into  a  fixed  num¬ 
ber  of  locations  in  matching-indices  buffer.  This  is  done  using  essentially  the  standard  procedure 
for  updating  a  Bloom  filter.  Specifically,  we  use  k  hash  functions  hi, . . . ,  hk  to  select  the  k  loca¬ 
tions  where  E  ( c,. )  will  be  added.  For  optimal  efficiency,  the  client  should  select  the  parameter  k  as 
|__z_jggjj  _  where  m  is  the  number  of  files  they  expect  to  retrieve  [4].  The  locations  of  the  matching- 
indices  buffer  that  a  matching  file  i  is  multiplied  into  are  take  to  be  hffi),  h2(i), . . . ,  hffi).  Again, 
if  the  ft  does  not  match,  c,  =  0  so  the  matching-indices  buffer  is  effectively  unmodified. 


After  completing  the  aforementioned  steps  for  a  fixed  number  of  files  t  in  its  stream,  the  server 
sends  its  three  buffers  back  to  the  client.  Also,  the  server  should  return  the  function  g. 
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Algorithm:  FileReconstruction 
Input:  F,  C ,  I,  k. 

Output:  The  matching  files  /Q/ ,  fa>2, ,  fa'r ■ 

Decrypt  each  element  of  F,  C,  and  I  to  obtain  F' ,  C' ,  and  /'  . 


fj  :=  0 

for  i  :=  1, 2, . . . ,  t  : 

for  j  :=  1,2,...,  A:  : 
t  :=  hj(i )  mod  tj 
if  I'[£]  =  0  :  next  i 
/?:=/?  +  1 
ap  :=  'i 

if  (3  >  Ip  : 

output  “Error,  overflow.”,  exit 
while  / 3  <  £f  : 

/?:=/?+  1 

ap  :=  pick({l, . . . ,  t}  \  {on,  a2, . . . ,  ag_ i}) 


A  := 


g{oLid) 


i:= 1,2,... /F 
./:  I--’ . '/ 


if  A  is  singulai-  : 

output  “Error,  singular  matrix.”,  exit 
c:=  A”1  C 

{«l!  «25  *  *  *  5  {  O  1 .  Cfc2 ,  •  •  •  p  OLtp  }  \  ^  t  |  0  } 

for  i  G  {  a*  I  cQi  =  0  }  : 
c«i  :=  1 


/  :=  diag(c)  1  •  A  1  •  E1' 

output  /«'  ,  /a!,,  •  •  •  ,  fa>r 


Figure  3:  The  algorithm  for  recovering  the  matching  files  after  the  completion  of  a  private  search. 


3.3  Client’s  FileReconstruction  Procedure 

Figure  3  gives  the  algorithm  run  by  the  client  upon  completion  of  the  private  search  and  receipt  of 
the  three  buffers  F,  C,  and  /,  FileReconstruction. 

Step  1:  Decrypt  buffers.  The  client  first  decrypts  the  values  in  the  three  buffers  using  the  Paillier 
decryption  algorithm  with  its  private  key  Kpriv,  obtaining  decrypted  buffers  F\  C",  and  V . 

Step  2:  Reconstruct  matching  indices.  For  each  of  the  indices  i  £  {1,2, ...  ,t},  the  client  computes 
h2(i), . . . ,  hk(i )  and  checks  the  corresponding  locations  in  the  decrypted  matching-indices 
buffer;  if  all  these  locations  are  non-zero,  then  i  is  added  to  the  list  a1;  a2,  ■  ■  ■ ,  oip  of  potential 
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matching  indices.  Note  that  if  ct  f  0,  then  i  will  be  added  to  this  list.  However,  due  to  the  false 
positive  feature  of  Bloom  filters,  we  may  obtain  some  additional  indices.  Now  we  may  check  for 
overflow,  which  occurs  when  the  number  of  false  positives  plus  the  number  of  actual  matches  r 
exceeds  Ip.  At  this  point  if  (3  <  ip,  we  continue  to  add  indices  to  the  list  until  it  is  of  length  ip. 
Here  the  function  pick  denotes  the  operation  of  selecting  an  arbitrary  member  of  a  set.  Note  that 
we  will  not  run  out  of  indices  since  t  >  ip. 

Step  3:  Reconstruct  c-values  of  matching  files.  Given  our  superset  of  the  matching  indices  { o , ,  a2  ■  ■  ■ , 
the  client  next  solves  for  the  values  of  cQl,ca2, . . . ,  ca(/i .  This  is  accomplished  by  solving  the  fol¬ 
lowing  system  of  linear  equations  for  c, 


A  ■  c  =  C' 


(1) 


where  A  is  the  matrix  with  the  i,j th  entry  set  to  g ( a, .  j ) ,  C'  is  the  vector  of  values  stored  in  the 
decrypted  c-buffer,  and  c  is  the  column  vector  (ca.)i= Now  the  exact  set  of  matching  indices 
{o>\ ,  «2  . . . ,  a'r\  may  be  computed  by  checking  whether  cai  =  0  for  each  i  e  {1  , ...  ,£p}.  Before 
proceeding,  we  replace  all  zeros  in  the  vector  c  with  ones. 

As  an  example  of  Step  3,  suppose  there  are  four  spots  in  the  decrypted  c-buffer  (i.e.,  ip  — 
4),  seven  files  are  processed,  and  we  have  established  the  following  list  of  potentially  matching 
indices:  {ai,a2,  a3,  ct4}  =  {1,  3,  5,  7}.  Then  given 


we  may  compute 


A 


(l 

1 

1 


\o 


0  1  0\ 

f2\ 

1  0  1 

,  C'  = 

3 

0  0  1 

1 

1  1  0) 

W 

cai  c  1  1 

Cq2  2 

Cq3  C5  1 

ca4  —  c7  —  0 


We  then  see  that  there  were  three  matching  files  (r  =  3):  /1,  f3,  and  /5. 

Step  4:  Reconstruct  matching  files.  Finally,  the  content  of  the  matching  files  fa' ,  fa>2, . . . ,  fa>r  may 
be  determined  by  solving  the  linear  system 

A  ■  diag(c)  •  f  —  F'  (2) 


where 


We  directly  compute  /  =  diag(c)  1  •  A  1  ■  F' .  Note  that  diag(c)  is  never  singular  because 
we  previously  ensured  that  no  zeros  appear  in  c.  The  content  of  the  matching  files  appears  as 

4The  possibility  of  the  matrix  A  being  singular  is  considered  in  the  next  section. 
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fa1 ,  fa'2,  ■  ■  • ,  /<;  the  other  entries  in  /  will  be  zero.  Continuing  the  example  above  (and  making 
up  a  value  of  F'),  this  corresponds  to  solving  the  following  equations 


h  +  h  =  32 
/i  +  2/3  +  h  =  32 

/1  +  fi  —  10 
2/3  +  h  =  44  , 

thereby  determining  that  /j  =  10,  f 3  =  11,  and  /5  =  22  (and  f 7  =  0,  but  this  value  is  ignored). 


4  Analysis 

4.1  Correctness  and  Complexity 

In  this  section,  we  give  the  correctness  and  complexity  analysis  of  our  scheme.  In  particular,  we 
will  show  that  given  a  desired  success  probability  bound  1— e,  if  the  number  of  matching  documents 
is  at  most  m,  then  by  using  communication  and  storage  overhead  0{m  log (t/m)),  our  scheme  will 
enable  the  user  to  correctly  reconstruct  all  the  matching  documents  from  a  stream  of  t  documents 
with  probability  at  least  1  —  e. 

In  order  to  perform  the  analysis  to  demonstrate  the  above  point,  we  first  analyze  the  different 
failure  cases  where  the  user  will  fail  to  reconstruct  the  matching  documents.  From  the  recon¬ 
struction  procedure,  we  can  see  that  the  client  fails  to  reconstruct  the  matching  files  when  the  two 
systems  of  linear  equations  A  ■  c  =  C'  (Eq.  1)  and  A  ■  diag(c)  ■  f  —  F'  (Eq.  2)  cannot  be  correctly 
solved.  This  failure  only  happens  in  two  cases: 

1.  The  matrix  A  is  singular.  In  this  case,  we  will  not  be  able  to  compute  A~l  and  solve  the 
system  of  linear  equations. 

2.  There  are  more  than  tp—r  false  positives  when  the  set  of  matching  indices  is  computed  using 
the  Bloom  filter.  In  particular,  if  in  Step  2  in  the  FileReconst ruction  procedure,  the 
number  of  matching  indices  [3  reconstructed  from  the  Bloom  filter  I'  is  greater  than  (>,  then 
we  have  more  variables  than  the  number  of  linear  equations  and  thus  we  will  not  be  able  to 
solve  the  system  of  linear  equations  A  ■  c  =  C' . 

We  show  below  that  by  picking  the  parameters  i p  and  £/  correctly,  we  can  guarantee  that  the 
probability  of  the  above  two  failure  cases  can  be  bounded  to  be  below  e.  We  demonstrate  this  by 
proving  the  following  three  lemmas. 

Lemma  1.  For  a  given  0  <  e  <  1,  there  exists  n  =  o(log(l/e)),  such  that  for  any  n'  >  n,  an 
n!  x  n'  random  (0, 1  )-matrix  is  singular  with  probability  at  most  e. 

Proof.  Note  that  an  n  x  n,  random  (0,l)-matrix  is  singular  with  negligible  probability  in  n.  This 
was  first  conjectured  by  Erdos  and  proven  in  the  60’s  by  J.  Komlos  [13].  The  specific  bound  has 
since  been  improved  several  times,  recently  reaching  O  ((|  +  o(l))“)  [12,22,23].  Thus,  it  is  easy 
to  see  that  the  above  lemma  holds.  □ 
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Lemma  2.  Let  G  :  /Q;  xZxZ->  {0. 1}  be  a  (ujt.  ujq.  e/S)-secure  pseudo-random  function  family. 

Let  g  =  Gfc,  where  k  /Q;.  Let  iF  =  o(log(l/e))  such  that  an  iF  x  £F  random  (0, 1  )-matrix  is 
singular  with  probability  at  most  e/4.  Then  the  matrix 


A 


is  singular  with  probability  at  most  e/2. 

Intuitively,  this  lemma  bounds  the  failure  probability  that  the  matrix  A  is  singular.  We  provide 
the  proof  in  Appendix  B.  Additionally,  we  note  that  for  a  given  constant  e  the  size  of  the  £F  will 
be  linear  in  m. 

Lemma  3.  Given  tF  >  m  +  81n(2/e),  let  If  =  0(mlog(t/m)),  and  assume  the  number  of  match¬ 
ing  files  is  at  most  rri  out  of  a  stream  oft.  Then  the  probability  that  the  number  of  reconstructed 
matching  indices  (3  is  greater  than  £F  is  at  most  e/2. 

Given  the  false  positive  rate  of  a  Bloom  filter,  the  proof  is  straightforward;  we  provide  it  in  Ap¬ 
pendix  C.  Together,  Lemma  2  and  Lemma  3  provide  the  primary  result: 

Theorem  1.  If  tF  =  o(log(l/e))  +  0(m),  iF  >  m  +  81n(2/e),  G  =  0(m\og(t/m)),  G  : 
/Q;  x  Z  x  Z  — ■»  {0,1}  is  a  (ujFujq.  r/S)-secure  pseudo -random  function  family,  then  when  the 
number  of  matching  files  is  at  most  m  in  a  stream  oft,  our  scheme  guarantees  that  the  client  can 
correctly  reconstruct  all  matching  files  with  probability  at  least  1  —  e. 

Proof.  By  Lemma  2,  the  probability  that  the  matrix  A  is  singular  is  at  most  e/2.  By  Lemma  3, 
the  probability  that  the  reconstruction  of  the  matching  indices  will  yield  more  than  £F  matching 
indices  is  at  most  e/2.  Since  these  are  the  only  two  failure  cases  as  explained  earlier,  the  total 
failure  probability,  the  probability  that  the  client  would  fail  to  reconstruct  the  matching  files,  is  at 
most  e.  □ 

4.2  Security 

The  security  of  the  proposed  system  according  to  Definition  1  is  straightforward.  Intuitively,  since 
the  server  is  only  provided  with  an  array  of  encryptions  of  ones  and  zeros,  the  scheme  should  be 
as  secure  as  the  underlying  cryptosystem. 

Theorem  2.  If  the  Paillier  cryptosystem  is  semantically  secure,  then  the  proposed  private  search¬ 
ing  scheme  is  semantically  secure  according  to  Definition  1. 

In  Appendix  D  we  provide  a  proof.  The  proof  is  straightforward  and  proceeds  as  in  the  case 
of  Ostrovsky  and  Skeith.  Note  that  this  establishes  security  based  on  the  decisional  composite 
residuosity  assumption,  since  that  was  used  to  prove  the  security  of  the  Paillier  cryptosystem. 
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5  Extensions 


Here  we  describe  several  extensions  to  the  proposed  system  which  provide  additional  features  or 
vary  performance  tradeoffs. 

5.1  Bloom  Filter  Space  Saving 

For  security  it  will  generally  be  necessary  to  use  a  modulus  n  of  at  least  1024  bits  (e.g.,  as  required 
by  the  standards  ANSI  X9.30,  X9.31,  X9.42,  and  X9.44  and  FIPS  186-2)  [20].  The  fact  the  c- 
values  will  never  approach  21024  reveals  that  the  Bloom  filter  /  is  in  fact  mostly  wasted  space.  A 
simple  technique  can  be  used  to  reclaim  some  of  this  space.  If  we  assume  that  the  sums  of  c-values 
appearing  in  each  location  in  /  will  be  less  than  216,  for  example,  we  may  use  each  group  element 
to  represent  y|  array  entries.  In  the  case  of  n  =  1024,  this  reduces  the  size  of  /  by  a  factor  of  64. 
When  we  need  to  multiply  a  value  E  (c)  into  the  Bloom  filter  in  the  StreamSearch  algorithm, 
we  use  the  following  technique.  To  multiply  it  into  the  zth  location  in  I,  we  let  i\  =  and 
i2  =  i  mod  64.  Then  we  compute 


I[k]  :=  /[*i]  •  E  (cf6i2 

which  has  the  result  of  shifting  c  into  the  i2 th  16-bit  block  within  the  group  element  in  I[ii].  After 
the  client  decrypts  /,  they  may  simply  break  up  each  element  into  64  regions  of  16  bits.  This 
space  savings  comes  at  an  additional  computation  cost,  however.  The  server  will  need  to  perform 
k  additional  modular  exponentiations  for  each  file  it  processes. 

5.2  Hashing  Keywords 

In  some  applications,  the  predetermined  set  of  possible  keywords  D  may  be  unacceptable.  Many 
of  the  strings  a  user  may  want  to  search  for  are  obscure  (e.g.,  names  of  particular  people  or  other 
proper  nouns)  and  including  them  in  D  would  already  reveal  too  much  information.  Since  the  size 
of  encrypted  queries  is  proportional  to  \D\,  it  may  not  be  feasible  to  fill  D  with,  say,  every  person’s 
name,  much  less  all  proper  nouns. 

In  such  applications  an  alternative  form  of  encrypted  query  may  be  used.  Eliminating  D,  we 
allow  K  to  be  any  finite  subset  of  X*,  where  X  is  some  alphabet.  Now  in  Query  Con  struct  ion, 
we  pick  a  length  Iq  for  the  array  0  and  initialize  each  element  to  E  (0).  Then  for  each  w  G  K,  we 
use  a  hash  function  h  :  X*  — ►  (1, . . . ,  £q}  to  select  a  location  h(w)  in  Q  and  set  Q[h(w)\  :=  E  (1). 
As  before  we  rerandomize  each  encryption.  To  process  the  zth  file  in  StreamSearch,  the  server 
may  now  compute  E  (q)  =  Ylwew  Q[h(w)\.  The  rest  of  the  scheme  is  unmodified.  Using  this 
extension,  it  is  possible  for  a  file  to  spuriously  match  the  query  if  there  is  some  word  w'  G  W% 
such  that  h{w')  =  h(w)  for  some  w  G  K.  The  possibility  of  such  false  positives  is  the  key 
disadvantage  of  this  approach. 

An  advantage  of  this  alternative  approach,  however,  is  that  it  is  possible  to  extend  the  types  of 
possible  queries.  Previously  only  disjunctions  of  keywords  in  D  were  allowed,  but  in  this  case  a 
limited  sort  of  conjunction  of  strings  may  be  achieved.  To  support  queries  of  the  form  “uq  uy” 
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where  w } .  w2  G  E*,  we  change  the  way  each  W%  is  derived  from  the  corresponding  file  f, .  In 
addition  to  including  each  word  found  in  the  file  /*,  we  include  all  adjacent  pairs  of  words  in  Ifj 
(note  that  this  approximately  doubles  the  size  of  IT)).  It  is  easy  to  imagine  further  extensions  along 
these  lines.  In  particular,  it  is  possible  to  match  against  binary  data  by  simply  including  blocks  of 
the  contents  of  f  ,  in  IT). 

5.3  Stream  Length  Independence 

In  applications  where  the  expected  number  of  matching  documents  is  fixed  and  independent  of 
the  stream  length,  a  modification  to  the  scheme  allows  communication  and  storage  independent  of 
the  stream  length  as  well.  To  produce  this  effect,  we  abandon  the  Bloom  filter  based  construction 
used  in  the  matching-indices  buffer  and  instead  use  the  Ostrovsky  Skeith  construction  to  store  the 
matching  indices.  We  briefly  describe  this  technique  below;  for  details  (including  an  analysis  of 
collision  detection)  refer  to  [18]. 

Let  1 1  =  7 m,  where  7  is  selected  based  on  the  desired  error  bound  e.  Fix  a  set  of  hash  functions 
hi,  h2,  ■  .  i- 1  h7.  Also,  let  each  entry  in  the  matching-indices  buffer  /  be  a  pair  of  ciphertexts  in  Z*2 
rather  than  a  single  ciphertext.  To  update  /  when  processing  the  ith  file  in  StreamSearch, 
compute  the  following. 

for  j  :=  1,2,..., 7  : 

£  :=  hj(i)  mod  £j 

I[£][  1]  :=  I{£][  1]  •  c  mod  n 2 

I[£][ 2]  :=  I[£][ 2]  •  c*  mod  n2 

To  recover  the  set  of  matching  indices  in  FileReconstruct  ion,  the  client  decrypts  each  pair 
of  entries  in  I.  When  a  pair  I'[k\  [1]  and  I'[k\  [2],  k  G  {1, . . .  £1}  is  non-zero  (and  not  a  collision), 
the  client  may  recover  the  index  of  a  matching  file  as  i  —  I'[k\  [2]  //'  [ k]  [1]. 

When  using  this  technique,  the  c-buffer  is  omitted.  We  may  set  (>  =  m;  otherwise,  the  data 
buffer  is  used  as  before.  All  parameters  are  now  selected  based  only  on  m  and  e  without  regard 
to  t,  and  there  are  no  false  positives  for  streams  of  any  length.  The  analysis  in  [18]  demonstrates 
that  the  probability  of  an  overflow  in  the  new  matching-indices  buffer  may  be  bounded  below  e 
with  7  =  0{\ogm  +  log(l/e)),  producing  an  overall  communication  and  storage  complexity  of 
0(m  log  m).  Note  that  our  scheme  still  produces  a  constant  factor  improvement  over  the  original 
scheme  of  Ostrovsky  and  Skeith  in  this  case.  If  each  file  requires  s  plaintext  blocks  (i.e.,  is  of 
length  ns  bits),  then  we  reduce  communication  and  storage  by  a  factor  of  approximately  s.  This 
is  accomplished  by  retrieving  the  bulk  of  the  content  through  the  efficient  data  buffer  and  only 
retrieving  document  indices  through  the  less  efficient  matching-indices  buffer. 

5.4  Arbitrary  Length  Files 

In  applications  where  the  files  are  expected  to  vary  significantly  in  length,  an  unacceptable  amount 
of  space  may  be  wasted  by  setting  an  upper  bound  on  the  length  of  the  files  and  padding  smaller 
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files  to  that  length.  Here  we  describe  a  modification  to  the  scheme  which  eliminates  this  source  of 
inefficiency  by  storing  each  block  of  a  file  separately. 

In  this  extension  QueryConstruct  ion  takes  two  upper  bounds  on  the  matching  content. 
We  let  m\  be  an  upper  bound  on  the  number  of  matching  files  and  m2  be  an  upper  bound  on  the 
total  length  of  the  matching  files,  expressed  in  units  of  Paillier  plaintext  blocks.  As  before,  the 
c-buffer  is  of  length  0(m i)  and  the  matching-indices  buffer  is  of  length  0(rrii  log(f/mi))  (or, 
using  the  alternative  construction  given  in  Section  5.3,  0(rrii  log  mi)).  The  data  buffer  is  now  set 
to  length  0(7712),  and  each  entry  in  the  data  buffer  is  now  a  single  ciphertext  rather  than  an  array 
fixed  to  an  upper  bound  on  the  length  of  each  file.  We  introduce  a  new  buffer  on  the  server  called 
the  length  buffer,  which  is  an  array  L  set  to  length  O(mi).  Intuitively,  the  length  buffer  will  be 
used  to  store  the  length  of  each  matching  file,  and  the  data  buffer  will  now  be  used  to  store  linear 
combinations  of  individual  blocks  from  each  file  rather  than  entire  files. 

We  briefly  describe  how  this  is  accomplished  in  more  concrete  terms.  Replace  the  correspond¬ 
ing  portion  of  StreamSearch  with  the  following,  where  ic  =  0{m  1)  is  the  length  of  the  c- 
buffer  and  length  buffer,  ip  =  0(m2)  is  the  length  of  the  data  buffer,  g  :  Z:!  — >  (0, 1}  is  an 
additional  pseudo-random  function,  di  is  the  length  of  the  ?th  file  in  the  stream,  and  the  dt  blocks 
of  the  file  are  denoted  fl)U  ff2,  ...,  ffdi. 

e  :=  cdl  mod  n 2 

for  j  :=  1,2, ...  ■ 

if  g(i,j)  =  1  : 

C'[j]  :=  C[j]  ■  c  mod  n 2 
L[j\  :=  L[j]  ■  e  mod  n2 

for  j,  :=  1,2, . . .  ,di  : 
e  :=  c^’h  mod  n2 
for  j2  :=  1,2, . . .  ,iF  : 
if  g(i,ji,j2)  =  l  ■ 

F[j2 ]  :=  F[j2\  ■  e  mod  n2 

The  client  may  use  a  modified  version  of  FileReconstruction  to  recover  the  matching  files. 
As  before,  the  matching-indices  buffer  /  is  used  to  determine  a  superset  of  the  indices  of  matching 
files,  and  a  matrix  A  of  length  (q  is  constructed  based  on  these  indices  using  g.  The  vector  c  is 
again  computed  as  c  :=  A~l  ■  C" .  The  client  next  computes  the  lengths  of  the  matching  files  as 
d  :=  diag(c)-1  •  A"1  •  L'.  If  )>A  dt  >  tF,  the  combined  length  of  the  files  is  greater  than  the 
prescribed  upper  bound  and  the  client  aborts.  Otherwise,  the  data  buffer  now  stores  a  system  of 
(iF  >  m2  linear  equations  in  terms  of  the  individual  blocks  of  the  matching  files.  Briefly,  the 
blocks  may  be  recovered  by  constructing  a  new  matrix  A,  filling  its  entries  by  evaluating  g  over 
the  indices  of  the  blocks  of  the  matching  files.  The  blocks  of  the  matching  files  are  then  computed 
as  /  :=  diag(c')-1  •  A-1  ■  F',  where  c'  is  as  c  but  with  the  zth  entry  repeated  di  times. 

Using  this  extension,  space  may  be  saved  if  the  matching  files  are  expected  to  vary  in  size. 
Some  information  about  the  number  expected  to  match  and  their  total  size  is  still  needed  to  set  up 
the  query,  but  the  available  space  may  now  be  distributed  arbitrarily  amongst  the  files. 
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6  Conclusion 


The  primary  contribution  of  our  scheme  is  the  improvement  of  server  storage  and  server  to  client 
communication  complexity  from  O(mlogm)  in  the  size  of  the  matching  files  to  ()(m  log(f/m)). 
In  the  common  streaming  case  of  each  document  matching  independently  from  other  documents, 
this  results  in  the  optimal  0(m )  complexity,  with  near  optimal  constant  factors.  A  practical  anal¬ 
ysis  with  problem  parameters  corresponding  to  a  realistic  application  is  given  in  [2],  an  extended 
abstract  on  the  performance  of  this  scheme.  It  is  shown  that  in  a  typical  scenario  with  a  long 
stream,  it  is  possible  to  avoid  failure  with  probability  over  0.99  while  using  communication  (and 
server  storage)  1.2m,  where  m  is  the  actual  size  of  the  matching  files,  before  the  factor  of  two 
inflation  due  to  the  Paillier  cryptosystem.  In  contrast,  we  found  the  scheme  of  Ostrovsky  and 
Skeith  to  result  in  storage  and  communication  as  high  as  24m  before  the  inflation  due  to  Paillier. 
In  applications  where  m  is  allowed  to  vary  arbitrarily,  independent  of  t,  a  modified  version  of  our 
scheme  returns  to  the  0(m  log  m)  communication  and  storage  complexity.  In  this  case  constant 
factor  improvements  are  made  over  the  previous  scheme  of  Ostrovsky  and  Skeith.  Both  versions 
of  our  scheme  achieve  the  increased  efficiency  through  a  novel  technique  for  efficiently  spread¬ 
ing  the  matching  documents  throughout  the  buffer  of  results,  the  former  also  employing  a  unique 
encrypted  Bloom  filter  construction.  Finally,  we  proved  correctness  and  security  results  for  the 
scheme  and  noted  some  extensions. 
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A  Terms  and  Notation 


For  easy  reference,  we  provide  a  single  list  of  the  terms  and  variables  introduced  and  defined 
throughout  the  text. 

client  the  person  or  machine  conducting  a  private  search,  i.e.,  generating  a  private  query  and 
eventually  recovering  the  content  that  matched  the  query 

server  the  person  or  machine  carrying  out  the  private  search  on  the  behalf  of  the  client 

n  Paillier  public  key  (n  =  P1P2,  where  p\  and  p2  are  large,  secret  primes) 

s  an  upper  bound  on  the  length  of  a  file  as  a  number  of  elements  from  Zn,  i.e.,  if  files  are  at  most 
b  bits,  then  s  —  [TI^T1 

t  number  of  files  processed  by  the  server  before  returning  buffers  to  the  client 
p  false  positive  rate  of  the  Bloom  filter  / 

D  global  dictionary  of  potential  keywords 
K  the  set  of  keywords  forming  the  query 
Wi  the  ?'th  word  in  D 

(jj  the  7  th  entry  in  the  query  array  (before  encryption),  corresponds  to  wt 

fi  the  7th  file  checked  by  the  server 

Wi  the  words  present  in  or  associated  with  the  7th  file5 

q  the  number  of  distinct  keywords  matched  by  the  7th  file,  i.e.,  K  n  Wt 

m  an  upper  bound  on  the  number  of  files  which  may  be  retrieved 

r  the  number  of  files  which  actually  match  the  query 

Q  the  encrypted  query,  an  array  of  \D\  elements  from  Z*2 

F  the  data  buffer,  an  array  of  ip  elements,  each  of  which  is  an  array  of  s  elements  from  Z*2 
C  the  coefficients  buffer,  an  array  of  ip  elements  from  Z*2 
/  the  matching  indices  buffer,  an  array  of  ij  elements  from  Z*2 

k  the  number  of  hash  functions  to  be  used  with  the  matching  indices  buffer,  set  to  [^^g2j 

5In  the  case  of  text  documents,  this  is  essentially  the  file  itself;  in  the  case  of  binary  files,  this  set  of  words  may  be 
metadata  bundled  with  the  file  (e.g.,  the  ID3  tag  of  an  MP3  file). 
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B  Proof  of  Lemma  2 


Lemma  2.  Let  G  :  ICq  xZxZ->  {0,1 }  be  a  (ujt.  ujq.  e/8)-secure  pseudo-random  function  family. 

Let  g  =  Gk,  where  k  <—  fCg.  Let  tF  =  o(log(l/e))  such  that  an  iF  x  iF  random  (0, 1  )-matrix  is 
singular  with  probability  at  most  e/4.  Then  the  matrix 


A 


i= 


is  singular  with  probability  at  most  e/2. 

Proof  We  know  that  an  iF  x  iF  random  (0,  l)-matrix  is  singular  with  probability  at  most  e/4. 
However,  in  our  scheme,  A  is  not  a  random  matrix,  but  a  matrix  constructed  using  the  pseudo¬ 
random  function  g.  Thus,  we  need  the  additional  proof  step  to  show  that  the  matrix  A  we  con¬ 
structed  using  the  pseudo-random  function  g  also  satisfies  the  non-singular  property  with  over¬ 
whelming  probability,  otherwise,  we  could  break  the  pseudo-random  function.  This  proof  step  is 
as  follows. 

Now  assume  for  contradiction  that  the  matrix  A  is  singular  with  probability  greater  than  e/2. 
Then  we  show  that  we  can  construct  an  adversary  B  with  Advg  >  e/4  with  polynomial  number  of 
queries  and  polynomial  time,  and  thus  contradicting  the  original  assumptions  of  G. 

To  do  so,  we  play  the  following  game.  We  flip  a  coin  6  G  {0, 1}  with  a  half  and  half  probability, 
the  adversary  B  is  given  one  of  two  worlds  in  which  he  can  make  a  number  of  queries  to  a  given 

R 

oracle.  If  6  =  1,  B  is  given  world  one,  where  g  =  Gk,  k  <—  ICg,  and  the  oracle  responds  to  a 
query  (i,  j)  with  g(i,j).  If  9  =  0,  the  adversary  B  is  given  world  two,  where  the  oracle  responds 
to  a  query  (i,  j)  by  picking  a  random  function  R  mapping  (i,  j)  to  {0, 1},  i.e.,  by  flipping  a  coin 
b  G  {0, 1}  with  a  half  and  half  probability  and  returning  b  (using  a  table  of  previous  queries  to 
ensure  consistency).  After  a  series  of  queries,  the  adversary  B  guesses  which  world  he  is  in.  The 
adversary  B  makes  his  guess  using  the  following  strategy:  First,  the  adversary  B  constructs  a 
matrix  A  by  querying  the  oracle  for  all  (i,  j)  where  i  G  {1, . . .  ,iF}  and  j  G  {1, . . .  iF};  then  the 
adversary  B  checks  if  A  is  singular.  If  yes,  he  guesses  that  he  is  in  world  one.  If  not,  he  guesses 
that  he  is  in  world  two. 

Thus,  we  can  compute  the  advantage  of  such  an  adversary  B. 

Advg  =  |  Pr [Ba  =  1]— Pr[£>R  =  1]|  =  |l/2Pr[A  is  singular)#  =  1]  — l/2Pr[A  is  singular)#  =  0] |  . 

From  the  above  assumptions,  Pr[Ais  singular^  =  1]  >  e/2,  and  Pr[Ais  singular^  =  0]  < 
e/4,  thus  Advg  >  e/8,  contradicting  the  original  assumptions  of  G.  □ 


20 


C  Proof  of  Lemma  3 


Lemma  3.  Given  iF  >  m  +  81n(2/e),  let  £j  =  0(m\og(t/m)),  and  assume  the  number  of 
matching  files  is  at  most  m,  the  probability  that  the  number  of  reconstructed  matching  indices  (3  is 
greater  than  iF  is  at  most  e/2. 

Proof  The  number  of  reconstructed  matching  indices  j3  equals  to  the  number  of  truly  matching 
files  plus  the  number  of  false  positives  from  the  reconstruction  using  the  Bloom  filter.  Thus,  we 
need  to  bound  this  number  of  false  positives  to  be  at  most  iF  —  m. 

The  false  positive  rate  p  of  the  Bloom  filter  storing  m  entries  is  as  follows  [4]. 

t  j  log  2 

m 

(3) 

Thus,  the  expectation  of  the  number  of  false  positives  is  pt.  For  simplicity,  let’s  set  pt  = 
{lF  —  m)/ 2.  Thus  ii  =  m(log2)~2  log(£  2/m).  Since  iF  is  set  to  be  linear  in  m,  with  ij  = 
0(m  log (t/m))  the  expected  number  of  false  positives  can  be  bounded  far  from  lF. 

Moreover,  we  can  model  the  number  of  false  positives  with  a  Bernoulli  random  variable  X 
with  rate  parameter  p  and  approximate  it  with  a  Gaussian  centered  at  the  expected  number  of  false 
positives.  From  Chemoff  bounds,  we  can  derive  that  Pr[X  >  £F  —  m]  <  exp(— (iF  —  m)/ 8). 
Thus,  with  £F  >  m  +  8  ln(2/e),  we  can  show  that  this  probability  is  bounded  by  e/2.  Thus,  we 
show  that  the  above  lemma  holds.  □ 
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D  Proof  of  Theorem  2 


Here  we  provide  a  proof  of  the  semantic  security  of  the  proposed  private  searching  system  assum¬ 
ing  the  semantic  security  of  the  Paillier  cryptosystem.  The  proof  is  simple;  in  fact  it  proceeds  in 
the  same  way  as  the  proof  of  semantic  security  in  Ostrovsky  and  Skeith’s  scheme  [18].  The  same 
proof  applies  whether  we  are  using  encrypted  queries  of  the  original  form  proposed  by  Ostrovsky 
and  Skeith  or  the  hash  table  queries  we  propose  as  an  extension. 

Theorem  2.  If  the  Paillier  cryptosystem  is  semantically  secure,  then  the  proposed  private  search¬ 
ing  scheme  is  semantically  secure  according  to  Definition  1. 

Proof  We  assume  there  is  an  adversary  A  that  can  play  the  game  described  in  Definition  1  with 
non-negligible  advantage  £  in  order  to  show  that  we  then  have  non-negligible  advantage  in  breaking 
the  security  of  the  Paillier  cryptosystem. 

First  we  initiate  a  game  with  the  Paillier  challenger,  receiving  public  key  n.  We  choose  plain¬ 
texts  m0,  nii  £  to  be  simply  m0  =  0  and  mi  =  1.  We  return  them  to  the  Paillier  challenger 
who  secretly  flips  a  coin  (3\  and  sends  us  E  (mgf). 

Now  we  initiate  a  game  with  A  and  send  them  the  modulus  n,  challenging  them  to  break  the 
semantic  security  of  the  private  searching  system.  They  send  us  two  sets  of  keywords,  K0  and  K\. 
We  flip  a  coin  f32  and  construct  the  query  Qg2  by  passing  Kg2  to  QueryConstruct ion.  Next 
we  replace  all  the  entries  in  Qg2  which  are  encryptions  of  one  with  E  (■ mg1 ),  re-randomizing  each 
time  by  multiplying  by  a  new  encryption  of  zero.  Note  that  with  probability  one  half,  ff  —  0  and 
Qg2  is  a  query  that  searches  for  nothing.  In  this  case  f32  has  no  influence  on  Qg2  since  Qg2  consists 
solely  of  uniformly  distributed  encryptions  of  zero.  Otherwise,  Qg2  searches  for  Kg2. 

Next  we  give  Qg2  to  A.  After  investigation,  A  returns  their  guess  (3'2.  If  ff2  =  (h,  we  let  the 
guess  for  our  challenge  be  ff  —  1  and  return  it  to  the  Paillier  challenger.  Otherwise  we  let  0[  —  0 
and  send  it  to  the  Paillier  challenger. 

Since  A  is  able  to  break  the  semantic  security  of  the  private  searching  system,  if  j3\  =  1  the 
probability  that  j3'2  =  (32  is  \  +  e,  where  £  is  a  non-negligible  function  of  the  security  parameter  n. 
If  /3\  =  0,  then  P  (fj'2  =  fh)  =  \,  since  d2  was  chosen  uniformly  at  random  and  it  had  no  bearing 
on  the  choice  of  02.  Now  we  may  compute  our  advantage  in  our  game  with  the  Paillier  challenger 
as  follows. 


P  (ft  =  ft)  =  P  (ft  =  lift  =  1)  i  +  P  (ft  =  0|ft  =  0)  | 

/I  \  1  11 

“  \2  "*"  /  2  ^  2  2 
1  +  £ 


Since  £  is  non-negligible,  so  is  □ 
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